Introduction

Research Question

What is STAG2? What should we expect if we knocked out STAG2 from a cell?

STAG2 is a important subunit in the cohesin complex that plays an important role in regulating sister chromatid alignment during cell division, and other genetic regulatory functions. Studies show that STAG2-mut EWS have higher rates of metastatic disease and worse outcomes. It is expected that the TC71 & A673 cell lines will have differing phenotype effects to STAG2 KO, for example STAG2 deletion will lead to TC71 growth defect but with A673 it will lead to a growth advantage. STAG1 levels could also increase, in order to possibly fill in for the removal of STAG2 in the cell cycle, a decrease of RAD21 is also observed. The cohesin complex also mediates intrachromosomal interactions including those conjoining enhancers to promoters. Loss of STAG2 produces highly consistent and stable transcriptional changes that may undergo selection to confer a competitive advantage. Two neurodvelopmental transcription factors, POU3F2 and NR2F1 were consistently upregulated in STAG2 KO studies. (Adane et al, 2022)

STAG2 and CDKN24 seem to share an exclusive pattern of genetic alterations. cell lines with STAG2 mutations seem likely to express p16 , and reciprocally all cases with CDKN2A deletion seem to express STAG2. (Tirode et al. 2015)

In a study published by the Cancer cell journal, a GSEA enrichment analysis was conducted and from the 18,889 signatures ranked by the average normalized enrichment score (NES), several of the top 20 signatures enriched in STAG2 proficient condition were EWSR1-FLI1-regulated gene signatures. (Surdez et al. 2021)

Data input

All samples are downloaded from the SRA Run Selector for GSE1322966, slight quality control via fastp was conducted and then the reads were aligned using Salmon for Transcript-level quantification files. The metadata for the samples was provided from the SRA Run Selector also. 3 sample comparisons were constructed using the sample data:

  1. SA2 KO vs WT in A673 cells.

  2. SA2 KO vs WT in TC71 cells.

samples <- read.table(file.path("Analysis/SraRunTable.txt"), sep = ",", header = TRUE) %>%
  dplyr::mutate(cell_line = ifelse(grepl("A673", x = source_name), "A673", "TC71")) %>%
  dplyr::mutate(condition = ifelse(grepl("WT|siCT", x = GENOTYPE), "Control", "GeneKO"))

A673_samples <- samples %>%
  dplyr::filter(GENOTYPE %in% c("WT", "SA2 KO") & cell_line == "A673")

TC71_samples <- samples %>%
  dplyr::filter(GENOTYPE %in% c("SA2 KO", "WT") & cell_line == "TC71")

A673_salmon_files <- file.path("Salmon/salmon.out", A673_samples$Run, "quant.sf") %>%
  setNames(object = , A673_samples$Run)

TC71_salmon_files <- file.path("Salmon/salmon.out", TC71_samples$Run, "quant.sf") %>%
  setNames(object = , TC71_samples$Run)

ensdb <- EnsDb.Hsapiens.v86

transcripts <- transcripts(ensdb, columns = c(listColumns(ensdb, "tx"), "gene_name"), return.type = "data.frame") %>%
  as_tibble() %>%
  dplyr::select(tx_id, gene_name)

A673_txi <- tximport(A673_salmon_files, type = "salmon", tx2gene = transcripts, ignoreTxVersion = TRUE)

TC71_txi <- tximport(TC71_salmon_files, type = "salmon", tx2gene = transcripts, ignoreTxVersion = TRUE)

A673_dds_txi <- DESeqDataSetFromTximport(A673_txi, colData = A673_samples, design = ~condition)

TC71_dds_txi <- DESeqDataSetFromTximport(TC71_txi, colData = TC71_samples, design = ~condition)

Exploratory Data Analysis

Sample Metadata

A673_metadata <- colData(A673_dds_txi)
TC71_metadata <- colData(TC71_dds_txi)

sample_metadata <- rbind(A673_metadata, TC71_metadata)

sample_metadata %>%
  as.data.frame() %>%
  dplyr::select(GENOTYPE, cell_line) %>%
  kbl(caption = htmltools::tags$caption("Table 1: Sample Metadata", style = "color:black")) %>%
  kable_styling(bootstrap_options = "striped", full_width = T, html_font = "Cambria")
Table 1: Sample Metadata
GENOTYPE cell_line
SRR9326185 SA2 KO A673
SRR9326186 SA2 KO A673
SRR9326187 SA2 KO A673
SRR9326195 WT A673
SRR9326196 WT A673
SRR9326197 WT A673
SRR9326198 SA2 KO TC71
SRR9326199 SA2 KO TC71
SRR9326200 SA2 KO TC71
SRR9326201 SA2 KO TC71
SRR9326202 SA2 KO TC71
SRR9326203 SA2 KO TC71
SRR9326208 WT TC71
SRR9326209 WT TC71
SRR9326210 WT TC71

PCA Plots

vst <- vst(A673_dds_txi)

A673_PCA <- plotPCA(vst, intgroup = c("cell_line", "GENOTYPE"), returnData = TRUE)
A673_percentvar <- round(100 * attr(A673_PCA, "percentVar"))

ggplot(A673_PCA, aes(PC1, PC2, color = GENOTYPE)) +
  geom_point(size = 3) +
  ggtitle("PCA Plot for A673 Samples") +
  xlab(paste0("PC1: ", A673_percentvar[1], "% variance")) +
  ylab(paste0("PC2: ", A673_percentvar[2], "% variance")) +
  coord_fixed()

TC71_vst <- vst(TC71_dds_txi)

TC71_PCA <- plotPCA(TC71_vst, intgroup = c("cell_line", "GENOTYPE"), returnData = TRUE)
TC71_percentvar <- round(100 * attr(TC71_PCA, "percentVar"))

ggplot(TC71_PCA, aes(PC1, PC2, color = GENOTYPE)) +
  geom_point(size = 3) +
  ggtitle("PCA Plot for TC71 Samples") +
  xlab(paste0("PC1: ", TC71_percentvar[1], "% variance")) +
  ylab(paste0("PC2: ", TC71_percentvar[2], "% variance")) +
  coord_fixed()

Results

Statistical Analysis

Volcano Plots

EnhancedVolcano(A673_result_df,
  lab = A673_result_df$Gene_name,
  x = "log2FoldChange",
  y = "pvalue",
  title = "Volcano Plot for A673 SA2 KO Cells",
  subtitle = "",
  pointSize = 1.0,
  labSize = 4.0,
  xlim = c(min(A673_sig_ordered_result$log2FoldChange), max(A673_sig_ordered_result$log2FoldChange)),
  ylim = c(0, 300)
)

EnhancedVolcano(TC71_result_df,
  lab = TC71_result_df$Gene_name,
  x = "log2FoldChange",
  y = "pvalue",
  title = "Volcano Plot for TC71 SA2 KO Cells",
  subtitle = "",
  pointSize = 1.0,
  labSize = 4.0,
  xlim = c(min(TC71_sig_ordered_result$log2FoldChange), max(TC71_sig_ordered_result$log2FoldChange)),
  ylim = c(0, 250)
)

Significant DEG tables

A673_table_result <- dplyr::select(A673_sig_ordered_result, Gene_name, log2FoldChange, stat, pvalue, padj) %>%
  datatable(
    class = "cell-border stripe",
    caption = htmltools::tags$caption("Table 2: A672 SA2 KO Significant DEGs", style = "color:black"), rownames = FALSE
  )
A673_table_result
TC71_table_result <- dplyr::select(TC71_sig_ordered_result, Gene_name, log2FoldChange, stat, pvalue, padj) %>%
  datatable(
    class = "cell-border stripe",
    caption = htmltools::tags$caption("Table 3: TC71 SA2 KO Significant DEGs", style = "color:black"), rownames = FALSE
  )
TC71_table_result

STAG2 Plot Counts

A673_counts <- plotCounts(A673_dds, gene = "STAG2", returnData = TRUE)

ggplot(A673_counts, aes(x = condition, y = count, color = condition)) +
  geom_point(size = 2) +
  ggtitle("STAG2 Count for A672 SA2 KO vs WT") +
  ylab(label = "STAG2 Expression (reads)") 

TC71_counts <- plotCounts(TC71_dds, gene = "STAG2", returnData = TRUE)

ggplot(TC71_counts, aes(x = condition, y = count, color = condition)) +
  geom_point(size = 2) +
  ggtitle("STAG2 Count for TC71 SA2 KO vs WT") + 
  ylab(label = "STAG2 Expression (reads)") 

Venn Diagram

A673_overexpressed <- A673_sig_ordered_result %>%
  dplyr::filter(log2FoldChange > 0)

A673_underexpressed <- A673_sig_ordered_result %>%
  dplyr::filter(log2FoldChange < 0)

TC71_overexpressed <- TC71_sig_ordered_result %>%
  dplyr::filter(log2FoldChange > 0)

TC71_underexpressed <- TC71_sig_ordered_result %>%
  dplyr::filter(log2FoldChange < 0)

grid.newpage()

over <- venn.diagram(list(A673_overexpressed$Gene_name, TC71_overexpressed$Gene_name),
  category.names = c("A673", "TC71"),
  filename = NULL,
  main = "Overlapping Overexpressed Genes",
  fill = c("red", "deepskyblue4"),
  lwd = 1, lty = 1
)
grid.draw(over)

grid.newpage()

under <- venn.diagram(list(A673_underexpressed$Gene_name, TC71_underexpressed$Gene_name),
  category.names = c("A673", "TC71"),
  filename = NULL,
  main = "Overlapping Underexpressed Genes",
  fill = c("red", "deepskyblue4"),
  lwd = 1, lty = 1
)
grid.draw(under)

4-way plot

A673_list <- A673_result_df %>%
  dplyr::select(log2FoldChange, stat, padj) %>%
  rownames_to_column() %>%
  dplyr::rename("Gene" = rowname, "A673_Log2FC" = log2FoldChange, "A673_stat" = stat, "A673_padj" = padj) %>%
  drop_na()


TC71_list <- TC71_result_df %>%
  dplyr::select(log2FoldChange, stat, padj) %>%
  rownames_to_column() %>%
  dplyr::rename("Gene" = rowname, "TC71_Log2FC" = log2FoldChange, "TC71_stat" = stat, "TC71_padj" = padj) %>%
  drop_na()

fourway_df <- inner_join(A673_list, TC71_list, by = "Gene") %>%
  dplyr::mutate(Sig_Group = case_when(
    A673_padj < 0.05 & TC71_padj < 0.05 ~ "Both",
    A673_padj < 0.05 ~ "A673-only", TC71_padj < 0.05 ~ "TC71-only",
    TRUE ~ "Not Significant"
  ))

dge_fourway <- ggplot(data = fourway_df, aes(x = A673_stat, y = TC71_stat, label = Gene, color = Sig_Group)) +
  geom_point(alpha = .8) +
  geom_hline(yintercept = 0, size = .1) +
  geom_vline(xintercept = 0, size = .1) +
  coord_fixed(ratio = 1) +
  scale_color_manual(values = c(
    "Both" = "blue",
    "A673-only" = "darkgoldenrod2",
    "TC71-only" = "firebrick",
    "Not Significant" = "grey"
  )) +
  labs(title = "Wald stat of Differentialy Expressed Genes in A673 and TC71 Cell lines")

ggplotly(dge_fourway)

Top 10 Over-expressed and 10 Under-expressed Heatmaps

A673_heatmap <- pheatmap(A673_sig_norm_dds_counts,
  main = "Top 10 Over- and Under- expressed A673 STAG2KO DEGs",
  color = palette(200),
  cluster_rows = TRUE,
  cluster_cols = TRUE,
  show_rownames = TRUE,
  annotation = dplyr::select(A673_heat_meta, condition),
  scale = "row"
)

TC71_heatmap <- pheatmap(TC71_sig_norm_dds_counts,
  main = "Top 10 Over- and Under- expressed TC71 STAG2KO DEGs",
  color = palette(200),
  cluster_rows = TRUE,
  cluster_cols = TRUE,
  show_rownames = TRUE,
  annotation = dplyr::select(TC71_heat_meta, condition),
  scale = "row"
)

GSEA Enrichment Analysis

GSEA Pathway Plots

A673_top_ES_plot <- gseaplot2(A673_gsea_res, geneSetID = A673_top_pathways, title = "Top 4 Overexpressed A673 Enrichment plots")

A673_top_ES_plot

A673_bot_ES_plot <- gseaplot2(A673_gsea_res, geneSetID = A673_bot_pathways, title = "Top 4 Underexpressed A673 Enrichment plots")

A673_bot_ES_plot

TC71_top_ES_plot <- gseaplot2(TC71_gsea_res, geneSetID = TC71_top_pathways, title = "Top 4 Overexpressed TC71 Enrichment Plot")

TC71_top_ES_plot

TC71_bot_ES_plot <- gseaplot2(TC71_gsea_res, geneSetID = TC71_bot_pathways, title = "Top 4 Underexpressed TC71 Enrichment Plot")

TC71_bot_ES_plot

GSEA Venn Diagram

A673_overexp_gsea <- A673_gsea_df %>%
  dplyr::filter(NES > 0) 

A673_overexp_gsea_id <- A673_overexp_gsea %>%
  pull(ID)

A673_underexp_gsea <- A673_gsea_df %>%
  dplyr::filter(NES < 0)

A673_underexp_gsea_id <- A673_underexp_gsea %>%
    pull(ID)

TC71_overexp_gsea <- TC71_gsea_df %>%
  dplyr::filter(NES > 0) 

TC71_overexp_gsea_id <- TC71_overexp_gsea %>%
  pull(ID)

TC71_underexp_gsea <- TC71_gsea_df %>%
  dplyr::filter(NES < 0)

TC71_underexp_gsea_id <- TC71_underexp_gsea %>%
  pull(ID)

grid.newpage()

overexp_gene_set <- venn.diagram(list(A673_overexp_gsea$ID, TC71_overexp_gsea$ID),
  category.names = c("A673", "TC71"),
  filename = NULL,
  main = "Overlapping Overexpressed Gene Sets",
  fill = c("red", "deepskyblue4"), 
  lwd = 1, lty =1)

grid.draw(overexp_gene_set)

grid.newpage()

underexp_gene_set <- venn.diagram(list(A673_underexp_gsea$ID, TC71_underexp_gsea$ID),
  category.names = c("A673", "TC71"),
  filename = NULL,
  main = "Overlapping Underexpressed Gene Sets",
  fill = c("red", "deepskyblue4"), 
  lwd = 1, lty =1)

grid.draw(underexp_gene_set)

GSEA Pathway Datatables

overexp_gene_overlap <- intersect(A673_overexp_gsea, TC71_overexp_gsea)

A673_overexp_gsea_df <- A673_gsea_df %>%
  dplyr::filter(NES > 0)

TC71_overexp_gsea_df <- TC71_gsea_df %>%
  dplyr::filter(NES > 0)

"%!in%" <- Negate("%in%")

cell_overexp_gsea_list <- list("A673" = A673_overexp_gsea_df, "TC71" = TC71_overexp_gsea_df)

overexp_genes_data <- lapply(cell_overexp_gsea_list, function(x) {
  (x[x$ID %!in% overexp_gene_overlap, ] %>%
    rownames_to_column("gene_set") %>%
    dplyr::arrange(NES) %>%
    dplyr::select(c(gene_set, NES, pvalue, p.adjust)) %>%
    dplyr::arrange(desc(NES)))
})

overexp_genes_data$A673 %>%
  datatable(class = "cell-border stripe", caption = htmltools::tags$caption("Table 4: A673 GSEA Overexpressed Gene Sets", style = "color:black"), rownames = FALSE)
overexp_genes_data$TC71 %>%
  datatable(class = "cell-border stripe", caption = htmltools::tags$caption("Table 5: TC71 GSEA Overexpressed Gene Sets", style = "color:black"), rownames = FALSE)
shared_overexp_genes <- inner_join(A673_overexp_gsea_df, TC71_overexp_gsea_df, by = "ID") %>%
  dplyr::mutate(meanrank = rowMeans(cbind(rank.x, rank.y))) %>%
  dplyr::arrange(meanrank) %>%
  dplyr::select(starts_with(c("ID", "NES", "p.adjust"))) %>%
  dplyr::rename("gene_sets" = "ID")

names(shared_overexp_genes) <- gsub("*.y", ".TC71", names(shared_overexp_genes))
names(shared_overexp_genes) <- gsub("*.x", ".A673", names(shared_overexp_genes))

datatable(shared_overexp_genes, class = "cell-border stripe", caption = htmltools::tags$caption("Table 6: Shared Overexpressed Genesets for Both Cell Lines", style = "color:black"), rownames = FALSE)
underexp_path_overlap <- intersect(A673_underexp_gsea, TC71_underexp_gsea)

A673_underexp_gsea_df <- A673_gsea_df %>%
  dplyr::filter(NES < 0)

TC71_underexp_gsea_df <- TC71_gsea_df %>%
  dplyr::filter(NES < 0)

"%!in%" <- Negate("%in%")

cell_underexp_gsea_list <- list("A673" = A673_underexp_gsea_df, "TC71" = TC71_underexp_gsea_df)

underexp_genes_data <- lapply(cell_underexp_gsea_list, function(x) {
  (x[x$ID %!in% underexp_path_overlap, ] %>%
    rownames_to_column("gene_set") %>%
    dplyr::arrange(NES) %>%
    dplyr::select(c(gene_set, NES, pvalue, p.adjust)) %>%
    dplyr::arrange(desc(NES)))
})

underexp_genes_data$A673 %>%
  datatable(class = "cell-border stripe", caption = htmltools::tags$caption("Table 7: A673 GSEA Underexpressed Genesets", style = "color:black"), rownames = FALSE)
underexp_genes_data$TC71 %>%
  datatable(class = "cell-border stripe", caption = htmltools::tags$caption("Table 8: TC71 GSEA Underexpressed Genesets", style = "color:black"), rownames = FALSE)
shared_underexp_genes <- inner_join(A673_underexp_gsea_df, TC71_underexp_gsea_df, by = "ID") %>%
  dplyr::mutate(meanrank = rowMeans(cbind(rank.x, rank.y))) %>%
  dplyr::arrange(meanrank) %>%
  dplyr::select(starts_with(c("ID", "NES", "p.adjust"))) %>%
  dplyr::rename("gene_sets" = "ID")

names(shared_underexp_genes) <- gsub("*.y", ".TC71", names(shared_underexp_genes))
names(shared_underexp_genes) <- gsub("*.x", ".A673", names(shared_underexp_genes))

datatable(shared_underexp_genes, class = "cell-border stripe", caption = htmltools::tags$caption("Table 9: Shared Underexpressed Genesets for Both Cell Lines", style = "color:black"), rownames = FALSE)

Enrichr Pathway Analysis

A673 enrichr Pathway enrichment

A673_resRmd <- llply(names(A673_gene_list), function(groupNow) {
  genesNow <- A673_gene_list[[groupNow]]
  response <- httr::POST(
    url = "https://maayanlab.cloud/Enrichr/addList",
    body = list(
      "list" = paste0(genesNow, collapse = "\n"),
      "description" = groupNow
    )
  )
  response <- jsonlite::fromJSON(httr::content(response, as = "text"))
  permalink <- paste0(
    "https://maayanlab.cloud/Enrichr/enrich?dataset=",
    response$shortId[1]
  )
  knitr::knit_child(
    text = c(
      "#### `r groupNow`",
      "",
      'Enrichr Link: <a href="`r permalink`" target="_blank">`r groupNow`</a>.',
      ""
    ),
    envir = environment(),
    quiet = TRUE
  )
})
cat(unlist(A673_resRmd), sep = "\n")

Over-expressed

Enrichr Link: Over-expressed.

Under-expressed

Enrichr Link: Under-expressed.

TC71 enrichr Pathway enrichment

TC71_resRmd <- llply(names(TC71_gene_list), function(groupNow) {
  genesNow <- TC71_gene_list[[groupNow]]
  response <- httr::POST(
    url = "https://maayanlab.cloud/Enrichr/addList",
    body = list(
      "list" = paste0(genesNow, collapse = "\n"),
      "description" = groupNow
    )
  )
  response <- jsonlite::fromJSON(httr::content(response, as = "text"))
  permalink <- paste0(
    "https://maayanlab.cloud/Enrichr/enrich?dataset=",
    response$shortId[1]
  )
  knitr::knit_child(
    text = c(
      "#### `r groupNow`",
      "",
      'Enrichr Link: <a href="`r permalink`" target="_blank">`r groupNow`</a>.',
      ""
    ),
    envir = environment(),
    quiet = TRUE
  )
})
cat(unlist(TC71_resRmd), sep = "\n")

Over-expressed

Enrichr Link: Over-expressed.

Under-expressed

Enrichr Link: Under-expressed.

Discussion

How do we know that the STAG2 KO treatment worked?

The PlotCount graphs for our cell lines show that the WT controls do express higher counts of STAG2, however the intensity of the change is different. The TC71 treatment cells had a far more extreme decrease in STAG2 levels compared to a relatively light change in A673 treatment cells.

What SA2 KO DEGs are specific to TC71? A673? and which are shared between the two?

The Over- and Under- expressed genes Venn diagrams give us a broad idea of what genes are specific to their cell line and shared between the two. The 4-way plot gives us a closer look at the genes in the samples, we can see what genes are over- and under - expressed for both cell lines and specific cell lines. The genes on the edge of the graph match up to genes listed on the heatmaps.

Any Notable results?

With Assistance from Henry, we saw something of note in the enrichr results for the over-expressed genes in the TC71 cell line. In the transcription section under TF Perturbations Followed by Expression, one of the results of note is FLI1 KD HUMAN GSE27524 CREEDSID GENE 1602 DOWN This is something we can use to begin an discussion for the relationship between STAG2 and Ewing sarcoma’s primary gene EWS-FLI1.

GSEA uses all the genes in our samples for analysis, while enrichr uses only the significant (Padj > 0.05) genes. Why?

GSEA can use all the sample’s genes because it doesn’t use arbitrary cutoffs for differential expression significance. The genes located around 0 for the stat value are viewed as insignificant compared to both positive and negative ends. However for our enrichr API set up we use the significant DEGs for our analysis because we set up the API to use crisp input so the genes we are providing need to overlap because no additional changes are being conducted before the enrichment analysis.

Works Cited

Adane B, Alexe G, Seong BKA, Lu D, Hwang EE, Hnisz D, Lareau CA, Ross L, Lin S, Dela Cruz FS, Richardson , Weintraub AS, Wang S, Iniguez AB, Dharia NV, Conway AS, Robichaud AL, Tanenbaum B, Krill-Burger JM, Vazquez F, Schenone M, Berman JN, Kung AL, Carr SA, Aryee MJ, Young RA, Crompton BD, Stegmaier K. STAG2 loss rewires oncogenic and developmental programs to promote metastasis in Ewing sarcoma. Cancer Cell. 2021 Jun 14;39(6):827-844.e10. doi: 10.1016/j.ccell.2021.05.007. PMID: 34129824; PMCID: PMC8378827.

Surdez D, Zaidi S, Grossetête S, Laud-Duval K, Ferre AS, Mous L, Vourc’h T, Tirode F, Pierron G, Raynal V, Baulande S, Brunet E, Hill V, Delattre O. STAG2 mutations alter CTCF-anchored loop extrusion, reduce cis-regulatory interactions and EWSR1-FLI1 activity in Ewing sarcoma. Cancer Cell. 2021 Jun 14;39(6):810-826.e9. doi: 10.1016/j.ccell.2021.04.001. Epub 2021 Apr 29. PMID: 33930311.

Tirode F, Surdez D, Ma X, Parker M, Le Deley MC, Bahrami A, Zhang Z, Lapouble E, Grossetête-Lalami S, Rusch M, Reynaud S, Rio-Frio T, Hedlund E, Wu G, Chen X, Pierron G, Oberlin O, Zaidi S, Lemmon G, Gupta P, Vadodaria B, Easton J, Gut M, Ding L, Mardis ER, Wilson RK, Shurtleff S, Laurence V, Michon J, Marec-Bérard P, Gut I, Downing J, Dyer M, Zhang J, Delattre O; St. Jude Children’s Research Hospital–Washington University Pediatric Cancer Genome Project and the International Cancer Genome Consortium. Genomic landscape of Ewing sarcoma defines an aggressive subtype with co-association of STAG2 and TP53 mutations. Cancer Discov. 2014 Nov;4(11):1342-53. doi: 10.1158/2159-8290.CD-14-0622. Epub 2014 Sep 15. PMID: 25223734; PMCID: PMC4264969.